Mining basic active structures from a large-scale database

نویسندگان

  • Naoto Takada
  • Norihito Ohmori
  • Takashi Okada
چکیده

BACKGROUND The Pubchem Database is a large-scale resource for chemical information, containing millions of chemical compound activities derived by high-throughput screening (HTS). The ability to extract characteristic substructures from such enormous amounts of data is steadily growing in importance. Compounds with shared basic active structures (BASs) exhibiting G-protein coupled receptor (GPCR) activity and repeated dose toxicity have been mined from small datasets. However, the mining process employed was not applicable to large datasets owing to a large imbalance between the numbers of active and inactive compounds. In most datasets, one active compound will appear for every 1000 inactive compounds. Most mining techniques work well only when these numbers are similar. RESULTS This difficulty was overcome by sampling an equal number of active and inactive compounds. The sampling process was repeated to maintain the structural diversity of the inactive compounds. An interactive KNIME workflow that enabled effective sampling and data cleaning processes was created. The application of the cascade model and subsequent structural refinement yielded the BAS candidates. Repeated sampling increased the ratio of active compounds containing these substructures. Three samplings were deemed adequate to identify all of the meaningful BASs. BASs expressing similar structures were grouped to give the final set of BASs. This method was applied to HIV integrase and protease inhibitor activities in the MDL Drug Data Report (MDDR) database and to procaspase-3 activators in the PubChem BioAssay database, yielding 14, 12, and 18 BASs, respectively. CONCLUSIONS The proposed mining scheme successfully extracted meaningful substructures from large datasets of chemical structures. The resulting BASs were deemed reasonable by an experienced medicinal chemist. The mining itself requires about 3 days to extract BASs with a given physiological activity. Thus, the method described herein is an effective way to analyze large HTS databases.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Implementation of Multidimensional Index Structures for Knowledge Discovery in Relational Databases

Efficient query processing is one of the basic needs for data mining algorithms. Clustering algorithms, association rule mining algorithms and OLAP tools all rely on efficient query processors being able to deal with high-dimensional data. Inside such a query processor, multidimensional index structures are used as a basic technique. As the implementation of such an index structures is a diffic...

متن کامل

IMPROVED BAT ALGORITHM FOR OPTIMUM DESIGN OF LARGE-SCALE TRUSS STRUCTURES

Deterring the optimum design of large-scale structures is a difficult task. Great number of design variables, largeness of the search space and controlling great number of design constraints are major preventive factors in performing optimum design of large-scale truss structures in a reasonable time. Meta-heuristic algorithms are known as one of the useful tools to d...

متن کامل

An Uncertainty-based Transition from Open Pit to Underground Mining

There are some large scale orebodies that extend from surface to the extreme depths of the ground. Such orebodies should be extracted by a combination of surface and underground mining methods. Economically, it is highly important to know the limit of upper and lower mining activities. This concern leads the mine designers to the transition problem, which is one of the most complicated problems...

متن کامل

Determining the Convex Hull in Large Multidimensional Databases

Determining the convex hull of a point set is a basic operation for many applications of pattern recognition, image processing, statistics, and data mining. Although the corresponding point sets are often large, the convex hull operation has not been considered much in a database context, and state-of-the-art algorithms do not scale well to non main-memory resident data sets. In this paper, we ...

متن کامل

Cascaded Multilevel Inverters with Reduced Structures Based on a Recently Proposed Basic Units: Implementing a 147-level Inverter

A multilevel inverter is capable of generating high-quality stepwise pseudo-sinusoidalvoltage with low THD , applicable to high-power and high-voltage systems. These types of topologiesmay require a large number of switches and power supplies. This leads to much cost, large size, andcomplicated control algorithms. Thus, newer topologies are being proposed to decrease the numberof power electron...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره 5  شماره 

صفحات  -

تاریخ انتشار 2013